Introduction

This document explores qualitative indicators from an ActivityInfo database that is monitoring Ecuador.

Indicator count totals
Nov 2013 to May 2019
Date Quantity Select Single-line text Multi-line text % of total data collected
Nov 2013 141,442 30,531 0 6,309 3.54%
June 2015 1,887,857 745,841 85,863 57,128 2.06%
Sept 2016 3,380,991 1,296,548 191,640 116,184 2.33%
May 2017 4,932,977 1,809,419 265,196 168,599 2.35%
May 2019 12,174,327 7,595,829 2,683,945 915,948 3.92%



From the perspective of ActivityInfo, it shows a clear need for new tools to support analysis of qualitative data as the absolute volume of qualitative data has increased by a factor of 150, and almost doubled as a relative share of all data collected.


Data preparation

The data has been extracted from ActivityInfo and pre-processed to make it ready for the analysis. Check the calls in the R/ directory of the project repository to see how the process went on.

Data import & preparation

Read the data from the source that has been extracted, cleaned, and transformed. Select the rows where the field type equals to NARRATIVE, this indicates that is a multi-line text field in ActivityInfo. Select these columns and analyze them by comparing and contrasting with other fields types associated with the textual field types.

partnerName n prop freq
ACNUR 747 0.604 60%
NRC 134 0.108 11%
PMA 88 0.071 7%
UNICEF 67 0.054 5%
OIM 62 0.050 5%
UNFPA 58 0.047 5%
CARE 27 0.022 2%
Dialogo Diverso 19 0.015 2%
RET 12 0.010 1%
ADRA 11 0.009 1%
Plan Internacional 5 0.004 0%
PNUD 3 0.002 0%
UNESCO 3 0.002 0%

The table above shows partner count per each record:

  • ACNUR has most of the records with a frequency of 60%.

  • Second, NRC comes with a frequency of 11%. The frequency difference between the partners ACNUR and NRC is 49%.

subPartnerName n prop freq
HIAS 584 0.472 47%
ACNUR 226 0.183 18%
NRC 137 0.111 11%
OIM 63 0.051 5%
UNFPA 57 0.046 5%
ADRA 36 0.029 3%
CARE 27 0.022 2%
Dialogo Diverso 18 0.015 1%
UNICEF 17 0.014 1%
RET 13 0.011 1%
Plan Internacional 5 0.004 0%
SJR 5 0.004 0%
Alas de Colibri 4 0.003 0%
Buen Pastor 4 0.003 0%
Casa Matilde 4 0.003 0%
Fundación de Mujeres de Sucumbios 4 0.003 0%
Fundación Tarabita 4 0.003 0%
Hermanas Salesias 4 0.003 0%
Hogar de Cristo 4 0.003 0%
Pastoral Social Cáritas Tulcán 4 0.003 0%
Patronato 4 0.003 0%
World Vision 4 0.003 0%
PNUD 3 0.002 0%
UNESCO 3 0.002 0%
JRS Ecuador 2 0.002 0%

The table above shows sub-partner count per each record.

  • The sub-partner reporting the most is HIAS, which is by 47%.

  • The rest of the sub-partners have small numbers in the responses, however, they might be reporting much with their partners. We will see this in the next plot.

The plots placed in the tabs below show the proportion of records entered by sub-partners and partners.

  • 518 out of 747 total responses of ACNUR is actually coming from HIAS.

  • UNICEF has more diversed partners in terms of reporting. 44% of responses of UNICEF comes from HIAS. 25% of reporting comes from the UNICEF itself.

  • The most diversed partner is PMA. There are 13 partners reporting. HIAS reports 40% of the records.

Those are the total numbers of reporting, not specific to the narratives. In the next section, we count the number of reportings done in the narrative sections.

Which partners and sub-partners are reporting?

ACNUR


ADRA


CARE


Dialogo Diverso


NRC


OIM


Plan Internacional


PMA


PNUD


RET


UNESCO


UNFPA


UNICEF


Narrative data

The number of partners and sub-partners recording narrative data

Not all partners (and sub-partners) enter narrative records.


The number cantons and provinces recording narrative data

The table is alphabetically ordered.

province canton n province.prop province.freq
AZUAY CUENCA 24 0.046 4%
BOLIVAR SAN MIGUEL 1 0.002 0%
CARCHI TULCAN 87 0.166 16%
CHIMBORAZO RIOBAMBA 2 0.004 0%
COTOPAXI LATACUNGA 3 0.006 0%
EL ORO HUAQUILLAS 40 0.076 7%
MACHALA 7 0.013 1%
ESMERALDAS ESMERALDAS 30 0.057 5%
SAN LORENZO 24 0.046 4%
GUAYAS GUAYAQUIL 46 0.088 8%
IMBABURA IBARRA 55 0.105 10%
LOS RIOS QUEVEDO 3 0.006 0%
MANABI MANTA 2 0.004 0%
PICHINCHA QUITO 99 0.189 18%
SANTO DOMINGO DE LOS TSACHILAS SANTO DOMINGO 27 0.051 5%
SUCUMBIOS LAGO AGRIO 71 0.135 13%
TUNGURAHUA AMBATO 2 0.004 0%
BAÑOS DE AGUA SANTA 2 0.004 0%

Treemap plot showing canton and province reporting frequencies.

Analysis

Label forms recode table

First of all, we shorten the names and therefore re code form topics because they appear to be too long and disarray the plots. The re coded table below provides a look up for form labels and their abbreviations:

i labelFormsRecode labelForms
1 Salud Salud
2 Alojamiento Alojamiento Temporal
3 Necesidades Necesidades básicas/Otro
4 Población Manejo de la información y entrega directa de la información a la población
5 Socios Manejo de la información para socios y análisis de las necesidades
6 VBG Protección_VBG
7 Tráfico Trata_y_tráfico
8 Educación Acceso_a_educación
9 Hábitat Acceso a vivienda y hábitat dignos en comunidades receptoras
10 Técnico Medios de vida y formación técnico-profesional
11 SocialCohesión Cohesión_social
12 Educacional Apoyo Educacional a Comunidades Receptoras
13 VBG_SSR Asistencia técnica para VBG-SSR
14 Fronteras Asistencia técnica para protección/gestión de fronteras
15 Coordinacion Asistencia técnica para gestion de la informacion y coordinacion
16 SectorLaboral Asistencia técnica para el sector laboral
17 Protección Asistencia técnica para protección
18 ProtecciónInfancia Asistencia técnica para protección de la infancia
19 LGBTI Protección_LGBTI

Response quality

Response quality means how much response the questions receive. The idea is to find relations that affect the response quality to understand if they work or not under some conditions.

Research questions:

  • What is the quality of textual responses in the narrative fields?

  • Is there any relationship between the word counts of response, question and description fields?

  • What is the distribution between response word count and explanatory variables such as the question, form topic, canton name, partner name, etc.

Assumptions:

  • Responses with a larger word count have more quality than the responses with smaller word count.

In other words, we assume that the more word the better is. The limitations are based on the unequal distribution of the data. The word count of responses and questions can be related to other things, such as the questions require short answers so then the responses tend to be shorter.

Additionally, we can have a cross-analysis to test these outcomes. It might be a good idea to have a small subset of data and ask an expert to test the assumptions qualitatively. For instance, we can take the first twenty responses with the highest word count and the last twenty responses with the lowest word count. We chose the extreme directions because they point out the greatest differences which are easier to test assumptions.

Word count

One issue with the nature of the questions is that they are only unique in a form. These questions can be distributed across multiple forms. The questions sharing the same name will have different meanings. For instance, the question “Cualitativo” from the form “Salud” should imply different thing than the question “Cualitativo” from the form “Protección_VBG”.

In order to solve this kind of problem:

  • We can combine question with the form and also its folder label. There we can achieve a unique name for each question.

  • Another thing to resolve this would be doing analysis to move the analysis up to form level. In this file, we did both, therefore the analysis shown as below:

Count of responses per topic/question:

labelForms question response .responseWordCount .questionWordCount partnerName canton description labelFormsRecode
Salud Cualitativ… 1. Entrega… 302 1 UNFPA MACHALA Descripció… Salud
Salud Cualitativ… 1. Entrega… 302 1 UNFPA LAGO AGRIO Descripció… Salud
Salud Cualitativ… 1. Entrega… 302 1 UNFPA HUAQUILLAS Descripció… Salud
Salud Cualitativ… 233 Equipo… 46 1 UNFPA SAN LORENZO Descripció… Salud
Salud Cualitativ… 1. Entrega… 302 1 UNFPA TULCAN Descripció… Salud
Salud Cualitativ… Se complem… 13 1 UNFPA LAGO AGRIO Descripció… Salud

It’s also a good practice to see the number of questions. For example, one question has two responses, therefore they’re short. Therefore, jittered points are added to give a glance about the number of observations in the same plot.

Figure: Box plot form topics and response word counts based on the raw data

Figure: Box plot form topics and response word counts based on the raw data

In the plot above, the outliers are shown in orange color. Outliers are the points placed outside the whiskers, which is the long line, of the boxplot[^1].

The response word count distribution per form topic categorized by partner name:

The response word count distribution per form topic categorized by canton name:

A caveat: Reducing multiple values down to a single value should be avoided in the early stages of the analysis because reducing hides a lot e.g. a bar chart showing average the word count per partner. Some partners may write longer than others, because:

  1. They actually write longer than other partners.

  2. The questions they answered require short answers.

The “Description” field

Some questions have the description field giving extra details about the questions.

Do some questions with the extra description field have better response quality than the questions which do not have it?

Looking at the table containing form name, question, description and so on:

We see in the plot below that the response word counts per form and colored if a response has a description field or not. Having a description field or not is calculated as that a description field has a minimum one word.

The responses with the longest word counts are the ones with description. Nevertheless, it is not so easy to see a clear trend that there’s a correlation between response word count and description fields. Interestingly, the questions in the form F15, which is Protección_VBG, has no description fields at all.

We look below the description word count and compare with the response word count (and remove the categorical field displaying if the question of response has a description field).

TODO ANOVA

Correlation

TODO

The regression line

We can look at multiple continuous variables in our data.

  • word count of response field: the dependent variable.

  • word count of question field: an independent variable.

  • word count of description field: an independent variable.

Scatter plots help understand the characteristics of those variables. However, we miss a general understanding that is the trend line.

The gray area around the lines shows the confidence band at the 0.95 level. Although there’s a straight slope in the linear regression line, we cannot say that the trend line is robust because the confidence band representing the uncertainty in the estimate is wide.

Logistic regression

TODO

Text analysis

In that section, we take text as data.

References

Silge J, Robinson D (2017). Text mining with R: A tidy approach. O’Reilly Media, Inc.

 

QualMiner project explores the qualitative data used for Venezuelan refugee response by applying text analysis & mining techniques. The project is funded by the UNHCR Innovation Fund.